Skip to main content
Log in

A statistical approach to calibrating the scores of biased reviewers of scientific papers

  • Published:
Metrika Aims and scope Submit manuscript

Abstract

Peer reviewing is the key ingredient of evaluating the quality of scientific work. Based on the review scores assigned by individual reviewers to papers, program committees of conferences and journal editors decide which papers to accept for publication and which to reject. A similar procedure is part of the selection process of grant applications and, among other fields, in sports. It is well known that the reviewing process suffers from measurement errors due to a lack of agreement among multiple reviewers of the same paper. And if not all papers are reviewed by all reviewers, the naive approach of averaging the scores is biased. Several statistical methods are proposed for aggregating review scores, which all can be realized by standard statistical software. The simplest method uses the well-known fixed-effects two-way classification with identical variances, while a more advanced method assumes different variances. As alternatives a mixed linear model and a generalized linear model are employed. The application of these methods implies an evaluation of the reviewers, which may help to improve reviewing processes. An application example with real conference data shows the potential of these statistical methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. The goodness-of-fit measure AIC depends on the maximized log likelihood and a penalty for an increasing number of estimated parameters. A model with a lower AIC should be preferred.

References

  • Alvo M, Yu PLH (2014) Statistical methods for ranking data. Springer, Berlin

    Book  MATH  Google Scholar 

  • Anastasi A, Urbina S (1997) Psychological testing. Prentice Hall, Englewood Cliffs

    Google Scholar 

  • Baumeister D, Erdélyi G, Hemaspaandra E, Hemaspaandra L, Rothe J (2010) Computational aspects of approval voting. In: Laslier J, Sanver R (eds) Handbook on approval voting. Springer, Berlin, pp 199–251

    Chapter  Google Scholar 

  • Bornmann L, Mutz R, Marx W, Schier H, Daniel H-D (2011) A multilevel modelling approach to investigating the predictive validity of editorial decisions: Do the editors of a high profile journal select manuscripts that are highly cited after publication? J R Stat Soc A 174(4):857–879

    Article  MathSciNet  Google Scholar 

  • Brams S, Fishburn P (2002) Voting procedures. In: Arrow K, Sen A, Suzumura K (eds) Handbook of social choice and welfare. North-Holland, Amsterdam, pp 173–236

    Chapter  Google Scholar 

  • Brandt F, Conitzer V, Endriss U (2013) Computational social choice. In: Weiß G (ed) Multiagent systems, 2nd edn. MIT Press, Cambridge, pp 213–283

    Google Scholar 

  • Chubin DE (1994) Grants peer review in theory and practice. Eval Rev 18(1):20–30

    Article  Google Scholar 

  • Cicchetti DV (1980) Reliability of reviews for the American Psychologist: A biostatistical assessment of the data. Am Psychol 35(3):300–303

    Article  Google Scholar 

  • Cohen RJ, Wserdlik ME (2005) Psychological testing and assessment. McGraw Hill, New York

    Google Scholar 

  • Conitzer V, Rothe J (eds) (2010) Proceedings of the 3rd international workshop on computational social choice. Universität Düsseldorf

  • Critchlow E, Fligner MA, Verducci JS (1991) Probability models on rankings. J Math Psychol 35:294–318

    Article  MATH  MathSciNet  Google Scholar 

  • Draper N, Smith H (1998) Applied regression analysis, 3rd edn. Wiley series in probability and statistics. Wiley, New York

    Google Scholar 

  • Faliszewski P, Hemaspaandra E, Hemaspaandra L, Rothe J (2009) A richer understanding of the complexity of election systems. In: Ravi S, Shukla S (eds) Fundamental problems in computing: essays in honor of professor Daniel J. Springer, Rosenkrantz, pp 375–406

    Chapter  Google Scholar 

  • Fligner MA, Verducci JS (1990) Posterior probabilities for a consensus ordering. Psychometrika 55(1):53–63

    Article  MathSciNet  Google Scholar 

  • Gao X, Alvo M (2005) A unified nonparametric approach for unbalanced factorial designs. J Am Stat Assoc 100:926–941

    Article  MATH  MathSciNet  Google Scholar 

  • Jayasinghe UW, Marsh HW, Bond NW (2003) A multilevel cross-classified modelling approach to peer review of grant proposals: the effects of assessor and researcher attributes on assessor ratings. J R Stat Soc A 166(3):279–300

    Article  MathSciNet  Google Scholar 

  • Jayasinghe UW, Marsh HW, Bond NW (2006) A new reader trial approach to peer review in funding research grants: an Australian experiment. Scientometrics 69(3):591–605

    Article  Google Scholar 

  • Koch K (1999) Parameter estimation and hypothesis testing in linear models, 2nd edn. Springer, Berlin

    Book  MATH  Google Scholar 

  • Lauw H, Lim E, Wang K (2007) Summarizing review scores of “unequal” reviewers. In: Proceedings of the 7th SIAM international conference on data mining. SIAM, pp 539–544

  • Lauw H, Lim E, Wang K (2008) Bias and controversy in evaluation systems. IEEE Trans Knowl Data Eng 20(11):1490–1504

    Article  Google Scholar 

  • Lee CJ, Sugimoto CR, Zhang G, Cronin B (2013) Bias in peer review. J Am Soc Inf Sci Technol 64(1):2–17

    Article  Google Scholar 

  • Marden JI (1995) Analyzing and modelling rank data. Chapman and Hall, London

    Google Scholar 

  • Marsh HW, Ball S (1981) Interjudgmental reliability of reviews for the Journal of Educational Psychology. J Educ Psychol 73(6):872–880

    Article  Google Scholar 

  • Marsh HW, Jayasinghe UW, Bond NW (2008) Improving the peer-review process for grant applications. Am Pyschol 63(3):160–168

    Article  Google Scholar 

  • Marsh HW, Bond NW, Jayasinghe UW (2007) Peer review process: assessments by application-nominated referees are biased, inflated, unreliable and invalid. Aust Psychol 42:33–38

    Article  Google Scholar 

  • Pedhazur E, Pedhazur Schmelkin L (1991) Measurement, design, and analysis: an integrated approach. Lawrence Erlbaum Associates, London

    Google Scholar 

  • Poon WY, Chan W (2002) Influence analysis of ranking data. Psychometrika 67(3):421–436

    Article  MATH  MathSciNet  Google Scholar 

  • Rothe J, Baumeister D, Lindner C, Rothe I (2011) Einführung in Computational Social Choice: Individuelle Strategien und kollektive Entscheidungen beim Spielen, Wählen und Teilen. Spektrum Akademischer Verlag, Heidelberg

  • Scheuermann B, Kiess W, Roos M, Jarre F, Mauve M (2009) On the time synchronization of distributed log files in networks with local broadcast media. IEEE/ACM Trans Netw 17(2):431–444

    Article  Google Scholar 

  • Sokal R, Rohlf F (2012) Biometry: the principles and practice of statistics in biological research, 4th edn. W. H. Freeman, San Francisco

    Google Scholar 

  • West BT, Welch KB, Galecki AT (2007) Linear mixed models: a practical guide using statistical software. Chapman and Hall/CRC, London

    Google Scholar 

  • Yu PLH (2000) Bayesian analysis of order-statistics models for ranking data. Psychometrika 65(3):281–299

    Article  MATH  Google Scholar 

Download references

Acknowledgments

The authors cordially thank an anonymous referee for valuable constructive comments on an earlier version of the paper and Mayer Alvo for information on the use of model (2) in statistical ranking. This work was supported in part by DFG grant RO 1202/15-1.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Magnus Roos.

Additional information

Joachim Rudolph: Deceased on November 17, 2014.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kuhlisch, W., Roos, M., Rothe, J. et al. A statistical approach to calibrating the scores of biased reviewers of scientific papers. Metrika 79, 37–57 (2016). https://doi.org/10.1007/s00184-015-0542-z

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00184-015-0542-z

Keywords

Mathematics Subject Classification

Navigation