A statistical approach to calibrating the scores of biased reviewers of scientific papers

Abstract

Peer reviewing is the key ingredient of evaluating the quality of scientific work. Based on the review scores assigned by individual reviewers to papers, program committees of conferences and journal editors decide which papers to accept for publication and which to reject. A similar procedure is part of the selection process of grant applications and, among other fields, in sports. It is well known that the reviewing process suffers from measurement errors due to a lack of agreement among multiple reviewers of the same paper. And if not all papers are reviewed by all reviewers, the naive approach of averaging the scores is biased. Several statistical methods are proposed for aggregating review scores, which all can be realized by standard statistical software. The simplest method uses the well-known fixed-effects two-way classification with identical variances, while a more advanced method assumes different variances. As alternatives a mixed linear model and a generalized linear model are employed. The application of these methods implies an evaluation of the reviewers, which may help to improve reviewing processes. An application example with real conference data shows the potential of these statistical methods.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3

Notes

  1. 1.

    The goodness-of-fit measure AIC depends on the maximized log likelihood and a penalty for an increasing number of estimated parameters. A model with a lower AIC should be preferred.

References

  1. Alvo M, Yu PLH (2014) Statistical methods for ranking data. Springer, Berlin

    Google Scholar 

  2. Anastasi A, Urbina S (1997) Psychological testing. Prentice Hall, Englewood Cliffs

    Google Scholar 

  3. Baumeister D, Erdélyi G, Hemaspaandra E, Hemaspaandra L, Rothe J (2010) Computational aspects of approval voting. In: Laslier J, Sanver R (eds) Handbook on approval voting. Springer, Berlin, pp 199–251

    Google Scholar 

  4. Bornmann L, Mutz R, Marx W, Schier H, Daniel H-D (2011) A multilevel modelling approach to investigating the predictive validity of editorial decisions: Do the editors of a high profile journal select manuscripts that are highly cited after publication? J R Stat Soc A 174(4):857–879

    MathSciNet  Article  Google Scholar 

  5. Brams S, Fishburn P (2002) Voting procedures. In: Arrow K, Sen A, Suzumura K (eds) Handbook of social choice and welfare. North-Holland, Amsterdam, pp 173–236

    Google Scholar 

  6. Brandt F, Conitzer V, Endriss U (2013) Computational social choice. In: Weiß G (ed) Multiagent systems, 2nd edn. MIT Press, Cambridge, pp 213–283

    Google Scholar 

  7. Chubin DE (1994) Grants peer review in theory and practice. Eval Rev 18(1):20–30

    Article  Google Scholar 

  8. Cicchetti DV (1980) Reliability of reviews for the American Psychologist: A biostatistical assessment of the data. Am Psychol 35(3):300–303

    Article  Google Scholar 

  9. Cohen RJ, Wserdlik ME (2005) Psychological testing and assessment. McGraw Hill, New York

    Google Scholar 

  10. Conitzer V, Rothe J (eds) (2010) Proceedings of the 3rd international workshop on computational social choice. Universität Düsseldorf

  11. Critchlow E, Fligner MA, Verducci JS (1991) Probability models on rankings. J Math Psychol 35:294–318

    MATH  MathSciNet  Article  Google Scholar 

  12. Draper N, Smith H (1998) Applied regression analysis, 3rd edn. Wiley series in probability and statistics. Wiley, New York

    Google Scholar 

  13. Faliszewski P, Hemaspaandra E, Hemaspaandra L, Rothe J (2009) A richer understanding of the complexity of election systems. In: Ravi S, Shukla S (eds) Fundamental problems in computing: essays in honor of professor Daniel J. Springer, Rosenkrantz, pp 375–406

    Google Scholar 

  14. Fligner MA, Verducci JS (1990) Posterior probabilities for a consensus ordering. Psychometrika 55(1):53–63

    MathSciNet  Article  Google Scholar 

  15. Gao X, Alvo M (2005) A unified nonparametric approach for unbalanced factorial designs. J Am Stat Assoc 100:926–941

    MATH  MathSciNet  Article  Google Scholar 

  16. Jayasinghe UW, Marsh HW, Bond NW (2003) A multilevel cross-classified modelling approach to peer review of grant proposals: the effects of assessor and researcher attributes on assessor ratings. J R Stat Soc A 166(3):279–300

    MathSciNet  Article  Google Scholar 

  17. Jayasinghe UW, Marsh HW, Bond NW (2006) A new reader trial approach to peer review in funding research grants: an Australian experiment. Scientometrics 69(3):591–605

    Article  Google Scholar 

  18. Koch K (1999) Parameter estimation and hypothesis testing in linear models, 2nd edn. Springer, Berlin

    Google Scholar 

  19. Lauw H, Lim E, Wang K (2007) Summarizing review scores of “unequal” reviewers. In: Proceedings of the 7th SIAM international conference on data mining. SIAM, pp 539–544

  20. Lauw H, Lim E, Wang K (2008) Bias and controversy in evaluation systems. IEEE Trans Knowl Data Eng 20(11):1490–1504

    Article  Google Scholar 

  21. Lee CJ, Sugimoto CR, Zhang G, Cronin B (2013) Bias in peer review. J Am Soc Inf Sci Technol 64(1):2–17

    Article  Google Scholar 

  22. Marden JI (1995) Analyzing and modelling rank data. Chapman and Hall, London

    Google Scholar 

  23. Marsh HW, Ball S (1981) Interjudgmental reliability of reviews for the Journal of Educational Psychology. J Educ Psychol 73(6):872–880

    Article  Google Scholar 

  24. Marsh HW, Jayasinghe UW, Bond NW (2008) Improving the peer-review process for grant applications. Am Pyschol 63(3):160–168

    Article  Google Scholar 

  25. Marsh HW, Bond NW, Jayasinghe UW (2007) Peer review process: assessments by application-nominated referees are biased, inflated, unreliable and invalid. Aust Psychol 42:33–38

    Article  Google Scholar 

  26. Pedhazur E, Pedhazur Schmelkin L (1991) Measurement, design, and analysis: an integrated approach. Lawrence Erlbaum Associates, London

    Google Scholar 

  27. Poon WY, Chan W (2002) Influence analysis of ranking data. Psychometrika 67(3):421–436

    MATH  MathSciNet  Article  Google Scholar 

  28. Rothe J, Baumeister D, Lindner C, Rothe I (2011) Einführung in Computational Social Choice: Individuelle Strategien und kollektive Entscheidungen beim Spielen, Wählen und Teilen. Spektrum Akademischer Verlag, Heidelberg

  29. Scheuermann B, Kiess W, Roos M, Jarre F, Mauve M (2009) On the time synchronization of distributed log files in networks with local broadcast media. IEEE/ACM Trans Netw 17(2):431–444

    Article  Google Scholar 

  30. Sokal R, Rohlf F (2012) Biometry: the principles and practice of statistics in biological research, 4th edn. W. H. Freeman, San Francisco

    Google Scholar 

  31. West BT, Welch KB, Galecki AT (2007) Linear mixed models: a practical guide using statistical software. Chapman and Hall/CRC, London

    Google Scholar 

  32. Yu PLH (2000) Bayesian analysis of order-statistics models for ranking data. Psychometrika 65(3):281–299

    MATH  Article  Google Scholar 

Download references

Acknowledgments

The authors cordially thank an anonymous referee for valuable constructive comments on an earlier version of the paper and Mayer Alvo for information on the use of model (2) in statistical ranking. This work was supported in part by DFG grant RO 1202/15-1.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Magnus Roos.

Additional information

Joachim Rudolph: Deceased on November 17, 2014.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kuhlisch, W., Roos, M., Rothe, J. et al. A statistical approach to calibrating the scores of biased reviewers of scientific papers. Metrika 79, 37–57 (2016). https://doi.org/10.1007/s00184-015-0542-z

Download citation

Keywords

  • Analysis of variance
  • Model with main effects
  • Design matrix
  • Peer reviewing
  • Review scores

Mathematics Subject Classification

  • 62J10
  • 91B82